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Abstract 

Many classification problems require decisions 
among a large number of competing classes. These 
tasks, however, are not handled well by general pur- 
pose learning methods and are usually addressed in 
an ad-hoc fashion. We suggest a general approach 
- a sequential learning model that utilizes classi- 
fiers to sequentially restrict the number of compet- 
ing classes while maintaining, with high probability, 
the presence of the true outcome in the candidates 
set. Some theoretical and computational properties 
of the model are discussed and we argue that these 
are important in NLP-like domains. The advantages 
of the model are illustrated in an experiment in part- 
of-speech tagging. 

1 Introduction 

A large number of important natural language infer- 
ences can be viewed as problems of resolving ambi- 
guity, either semantic or syntactic, based on proper- 
ties of the surrounding context. These, in turn, can 
all be viewed as classification problems in which 
the goal is to select a class label from among a 
collection of candidates. Examples include part-of 
speech tagging, word-sense disambiguation, accent 
restoration, word choice selection in machine trans- 
lation, context-sensitive spelling correction, word 
selection in speech recognition and identifying dis- 
course markers. 

Machine learning methods have become the 
most popular technique in a variety of classifi- 
cation problems of these sort, and have shown 
significant success. A partial list consists of 
Bayesian classifiers ( Gale et al., 1993| ), decision 
lists ( |Yarowsky, 1994 ), Bayesian hybrids (Gold- 
ing, 1995), H MMs (|charniak, 1993Q , inductive 
logic methods (Zelle and Mooney, 1996), memory- 



based methods (Zavrel et al., 1997), linear classi 



hers (Roth, 1998 



Roth, 1999) and transformation- 
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based learning ( prill, 1995[ ). 

In many of these classification problems a signif- 
icant source of difficulty is the fact that the number 
of candidates is very large - all words in words se- 
lection problems, all possible tags in tagging prob- 
lems etc. Since general purpose learning algorithms 
do not handle these multi-class classification prob- 
lems well (see below), most of the studies do not 
address the whole problem; rather, a small set of 
candidates (typically two) is first selected, and the 
classifier is trained to choose among these. While 
this approach is important in that it allows the re- 
search community to develop better learning meth- 
ods and evaluate them in a range of applications, 
it is important to realize that an important stage is 
missing. This could be significant when the clas- 
sification methods are to be embedded as part of 
a higher level NLP tasks such as machine transla- 
tion or information extraction, where the small set 
of candidates the classifier can handle may not be 
fixed and could be hard to determine. 

In this work we develop a general approach to 
the study of multi-class classifiers. We suggest a se- 
quential learning model that utilizes (almost) gen- 
eral purpose classifiers to sequentially restrict the 
number of competing classes while maintaining, 
with high probability, the presence of the true out- 
come in the candidate set. 

In our paradigm the sought after classifier has to 
choose a single class label (or a small set of la- 
bels) from among a large set of labels. It works 
by sequentially applying simpler classifiers, each 
of which outputs a probability distribution over the 
candidate labels. These distributions are multiplied 
and thresholded, resulting in that each classifier in 
the sequence needs to deal with a (significantly) 
smaller number of the candidate labels than the pre- 
vious classifier. The classifiers in the sequence are 



selected to be simple in the sense that they typically 
work only on part of the feature space where the de- 
composition of feature space is done so as to achieve 
statistical independence. Simple classifier are used 
since they are more likely to be accurate; they are 
chosen so that, with high probability (w.h.p.), they 
have one sided error, and therefore the presence of 
the true label in the candidate set is maintained. The 
order of the sequence is determined so as to maxi- 
mize the rate of decreasing the size of the candidate 
labels set. 

Beyond increased accuracy on multi-class classi- 
fication problems , our scheme improves the com- 
putation time of these problems several orders of 
magnitude, relative to other standard schemes. 

In this work we describe the approach, discuss 
an experiment done in the context of part-of-speech 
(pos) tagging, and provide some theoretical justifi- 
cations to the approach. Sec. |2| provides some back- 
ground on approaches to multi-class classification 
in machine learning and in NLP. In Sec. |5| we de- 
scribe the sequential model proposed here and in 
Sec. ^| we describe an experiment the exhibits some 
of its advantages. Some theoretical justifications are 
outlined in Sec. ||. 

2 Multi- Class Classification 

Several works within the machine learning commu- 
nity have attempted to develop general approaches 
to multi-class classification. One of the most 
promising approaches is that of error cor recting out- 
put codes ( pietterich and Bakiri, 1995 ); however, 
this approach has not been able to handle well a 
large number of classes (over 10 or 15, say) and its 
use for most large scale NLP applications is there- 
fore questionable. Statistician have studied several 
schemes such as learning a single classifier for each 
of the class labels (one vs. all) or learning a discrim- 
inator for each pair of class labels, and discussed 



their relative merits(Hastie and Tibshirani, 1998). 



Although it has been argued that the latter should 
provide better results than others, experimental re- 



sults have been mixed (Allwein et al., 2000) and in 



some cases, more involved schemes, e.g., learning a 
classifier for each set of three class labels (and de- 
ciding on the prediction in a tournament like fash- 
ion) were shown to perform better (Teow and Loe, 
2000). Moreover, none of these methods seem to be 
computationally plausible for large scale problems, 
since the number of classifiers one needs to train is, 
at least, quadratic in the number of class labels. 



Within NLP, several learning works have already 
addressed the problem of multi-class classification. 
In ( Kudoh and Matsumoto, 200C ) the methods of 
"all pairs" was used to learn phrase annotations for 
shallow parsing. More than 200 different classifiers 
where used in this task, making it infeasible as a 
general solution. All other cases we know of, have 
taken into account some properties of the domain 
and, in fact, several of the works can be viewed as 
instantiations of the sequential model we formalize 
here, albeit done in an ad-hoc fashion. 

In speech recognition, a sequential model is used 
to process speech signal. Abstracting away some 
details, the first classifier used is a speech signal an- 
alyzer; it assigns a positive probability only to some 
of the words (using Levenshtein distance (Leven- 
shtein, 1966) or somewhat more sophisticated tech- 
niques ( Levinson et al., 1990| )). These words are 
then assigned probabilities using a different contex- 
tual classifier e.g., a language model, and then, (as 
done in most current speech recognizers) an addi- 
tional sentence level classifier uses the outcome of 
the word classifiers in a word lattice to choose the 
most likely sentence. 

Several word prediction tasks make decisions in 
a sequential way as well. In spell correction con- 
fusion sets are created using a classifier that takes 
as input the word transcription and outputs a posi- 
tive probability for potential words. In conventional 
spellers, the output of this classifier is then given 
to the user who selects the intended word. In con- 
text sensitive spelling correction (Golding and Roth, 
1999; [Mangu and Brill, \99% an additional classi- 
fier is then utilized to predict among words that are 
supported by the first classifier, using contextual and 
lexical information of the surrounding words. In all 
studies done so far, however, the first classifier - the 
confusion sets - were constructed manually by the 
researchers. 

Other word predictions tasks have also con- 
structed manuall y the list of confusion sets (Lee 
and Pereira, 1999; pagan et al., 1999| ; Lee, 1999j ) 
and justifications where given as to why this is a 
reasonable way to construct it. (Even-Zohar and 
Roth, 2000) present a similar task in which the con- 
fusion sets generation was automated. Their study 
also quantified experimentally the advantage in us- 
ing early classifiers to restrict the size of the confu- 
sion set. 

Many other NLP tasks, such as pos tagging, 
name entity recognition and shallow parsing require 



multi-class classifiers. In several of these cases the 
number of classes could be very large (e.g., pos tag- 
ging in some languages, pos tagging when a finer 
proper noun tag is used). The sequential model sug- 
gested here is a natural solution. 

3 The Sequential Model 

We study the problem of learning a multi-class clas- 
sifier, / : X -» C where X C {0, l} n , C = 
{ci, ...,c m } and m is typically large, on the order 
of 10 2 — 10 5 . We address this problem using the 
Sequential Model (SM) in which simpler classifiers 
are sequentially used to filter subsets of C out of 
consideration. 

The sequential model is formally defined as a 5- 
tuple: 

SM = { {X 1 }, C, O, {fi}, {e t } }, 

where 

• X = {jf =l X l is a decomposition of the do- 
main (not necessarily disjoint; it could be that 

yi,x { = x). 

• C is the set of class labels. 

• O = {oi, 02, ojv} determines the order in 
which the classifiers are learned and evaluated. 
For convenience we denote f\ = f 01 , /2 = 

fo2 ) • • ■ 

• {fi}i is the set of classifiers used by the 
model, fi : (X\2\°\) -» [0, 1]I C L 

• {e^li^ is a set of constant thresholds. 

Given x € X % and a set Cj_i of class labels, 
the zth classifier outputs a probability distribution^] 
Pi = (Pi(ci\x), ...,pi(c m \x)) over labels in C 
(where pi(c\x) is the probability assigned to class 
c by fi), and Pi satisfies that if c ^ Cj-i then 
Pi(c\x) = 0. 

The set of remaining candidates after the ith clas- 
sification stage is determined by P, and ef 

d = {ce C\pi(c\x) > e { }. 

The sequential process can be viewed as a mul- 
tiplication of distributions. ( Hinton, 2000| ) argues 
that a product of distributions (or, "experts", PoE) 



is an efficient way to make decisions in cases where 
several different constrains play a role, and is ad- 
vantageous over additive models. In fact, due to the 
thresholding step, our model can be viewed as a se- 
lective PoE. The thresholding ensures that the SM 
has the following monotonicity property: 

{c G C\ Pi(c\x) > ej C {c € C\ pj_i(c|x) > e^_i} 

that is, as we evaluate the classifiers sequentially, 
smaller or equal (size) confusion sets are consid- 
ered. A desirable design goal for the SM is that, 
w.h.p., the classifiers have one sided error (even at 
the price of rejecting fewer classes). That is, if 
ct is the true target^, then we would like to have 
that Pi(ct\x) > £j. The rest of this paper presents 
a concrete instantiation of the SM, and then pro- 
vides a theoretical analysis of some of its properties 
(Sec. ||). This work does not address the question of 
acquiring SM i.e., learning O. 

4 Example: POS Tagging 

This section describes a two part experiment of pos 
tagging in which we compare, under identical con- 
ditions, two classification models: A SM and a sin- 
gle classifier. Both are provided with the same input 
features and the only difference between them is the 
model structure. 

In the first part, the comparison is done in the 
context of assigning pos tags to unknown words - 
those words which were not presented during train- 
ing and therefore the learner has no baseline knowl- 
edge about possible POS they may take. This ex- 
periment emphasizes the advantage of using the SM 
during evaluation in terms of accuracy. The second 
part is done in the context of pos tagging of known 
words. It compares processing time as well as accu- 
racy of assigning pos tags to known words (that is, 
the classifier utilizes knowledge about possible POS 
tags the target word may take). This part exhibits a 
large reduction in training time using the SM over 
the more common one-vs-all method while the ac- 
curacy of the two methods is almost identical. 

Two types of features - lexical features and 
contextual features may be used when learning 
how to tag words for pos. Contextual features cap- 
ture the information in the surrounding context and 
the word lemma while the lexical features capture 
the morphology of the unknown word.| Several is- 



The output of many classifiers can be viewed, after appro- 
priate normalization, as a confidence measure that can be used 
as our Pi . 



"We use the terms class and target interchangeably. 
'Lexical features are used only when tagging unknown 



words. 



sues make the pos tagging problem a natural prob- 
lem to study within the SM. (i) A relatively large 
number of classes (about 50). (ii) A natural decom- 
position of the feature space to contextual and lexi- 
cal features, (iii) Lexical knowledge (for unknown 
words) and the word lemma (for known words) pro- 
vide, w.h.p, one sided error ( Mikheev, 1997 ). 



4.1 The Tagger Classifiers 

The domain in our experiment is defined using the 
following set of features, all of which are computed 
relative to the target word iOj. 

Contextual Features (as in ( prill, 1995| ; Roth 
and Zelenko, 1998)): 

Let tj_i, (ti+i) be the tags of the word preceding, 
(following) the target word, respectively. 



1. 


U-i- 


2. 


U+i- 


3. 


U-2- 


4. 


ti+2- 


5. 




6. 




7. 





8. Baseline tag for word Wi. In case Wi is an 
unknown word, the baseline is proper singular noun 
"NNP" for capitalized words and common singular 
noun "AW" otherwise. (This feature is introduced 
only in some of the experiments.) 

9The target word wi. 

Lexical Features: 

Let a, P, 7 be any three characters observed in the 
examples. 

10. Target word is capitalized. 

11. ends with a and length^) > 3. 

12. Wi ends with [3a and length(wj) > 4. 

13. Wi ends with ^f3a and length(uij) > 5. 

In the following experiment, the SM used for un- 
known words makes use of three different classifiers 
/i, fi and fs or /g, defined as follows: 

fx =: a classifier based on the lexical feature #10. 

fi =: a classifier based on lexical features #11—13 

/3 = : a classifier based on contextual features # 1 — 
9. 

/g =: a classifier based on all the features, #1 — 13. 

The SM is compared with a single classifier - either 
fz or /g. Notice that /g is a single classifier that 
uses the same information as used by the SM. Fig [I] 



sentence word 



Capitalized 



Suffix 



326+10+5 



Context 



prediction 



Figure 1 : POS Tagging of Unknown Word using 
Contextual and Lexical features in a Sequential 
Model. The input for capitalized classifier has 2 
values and therefore 2 ways to create confusion 
sets. There are at most 3( 26 + 10 + 5 ) different in- 
puts for the suffix classifier (26 character + 10 
digits + 5 other symbols), therefore suffix may 
emit up to 3( 26 + 10 + 5 ) confusion sets. 



illustrates the SM that was used in the experiments. 

All the classifiers in the sequential model, as 
well as the single classifier, use the SNoW learn- 
ing architecture ( Roth, 1998 ) with the Winnow up- 
date rule. SNoW (Sparse Network of Winnows) 
is a multi-class classifier that is specifically tai- 
lored for learning in domains in which the poten- 
tial number of features taking part in decisions is 
very large, but in which decisions actually depend 
on a small number of those features. SNoW works 
by learning a sparse network of linear functions 
over a pre-defined or incrementally learned feature 
space. SNoW has already been used successfully on 
several tasks in natural language processing (Roth, 



\\ Roth and Zelenko, 1998|; Go lding and~Roth, 



1998; 

1999; |Punyakanok and Roth, 200 \\ . 

Specifically, for each class label SNoW learns a 
function f c :X—> [0, 1] that maps a feature based 
representation x of the input instance to a number 
a c (x) € [0, 1] which can be interpreted as the prob- 



ability of c being the class label corresponding to x. 
At prediction time, given x E X, SNoW outputs 



SNoW(x) = max c {a c (x)}. 



(1) 



All functions - in our case, 50 target nodes are 
used, one for each pos tag - reside over the same 
feature space, but can be thought of as autonomous 
functions (networks). That is, a given example is 
treated autonomously by each target subnetwork; an 
example labeled t is considered as a positive exam- 
ple for the function learned for t and as a negative 
example for the rest of the functions (target nodes). 
The network is sparse in that a target node need not 
be connected to all nodes in the input layer. For ex- 
ample, it is not connected to input nodes (features) 
that were never active with it in the same sentence. 

Although SNoW is used with 50 different targets, 
the SM utilizes by determining the confusion set dy- 
namically. That is, in evaluation (prediction), the 
maximum in Eq. [I] is taken only over the currently 
applicable confusion set. Moreover, in training, a 
given example is used to train only target networks 
that are in the currently applicable confusion set. 
That is, an example that is positive for target t, is 
viewed as positive for this target (if it is in the con- 
fusion set), and as negative for the other targets in 
the confusion set. All other targets do not see this 
example. 

The case of POS tagging of known words is han- 
dled in a similar way. In this case, all possible tags 
are known. In training, we record, for each word Wi, 
all pos tags with which it was tagged in the training 
corpus. During evaluation, whenever word Wi oc- 
curs, it is tagged with one of these pos tags. That 
is, in evaluation, the confusion set consists only of 
those tags observed with the target word in train- 
ing, and the maximum in Eq. [I] is taken only over 
these. This is always the case when using (or /g), 
both in the SM and as a single classifier. In training, 
though, for the sake of this experiment, we treat f% 
(/g) differently depending on whether it is trained 
for the SM or as a single classifier. When trained as 



a single classifier (e.g., (Roth and Zelenko, 1998)), 
/ 3 uses each i-tagged example as a positive exam- 
ple for t and a negative example for all other tags. 
On the other hand, the SM classifier is trained on a 
i-tagged example of word w, by using it as a posi- 
tive example for t and a negative example only for 
the effective confusion set. That is, those pos tags 
which have been observed as tags of w in the train- 
ing corpus. 



4.2 Experimental Results 

The data for the experiments was extracted from the 
Penn Treebank WSJ and Brown corpora. The train- 
ing corpus consists of 2, 400, 000 words. The test 
corpus consists of 280,000 words of which 5,412 
are unknown words (that is, they do not occur in the 
training corpus. (Numbers (the pos "CD"), are not 
included among the unknown words). 

POS Tagging of Unknown Words 



h 


/3 + baseline 


baseline 


8.6 


61.8 


60.8 



Table 1 : POS tagging of unknown words using 
contextual features (accuracy in percent). /3 is 

a classifier that uses only contextual features, / 3 + 
baseline is the same classifier with the addition of 
the baseline feature ("NNP" or "NN"). 

Table summarizes the results of the experiments 
with a single classifier that uses only contextual fea- 
tures. Notice that adding the baseline POS signifi- 
cantly improves the results but not much is gained 
over the baseline. The reason is that the baseline 
feature is almost perfect (94.4%) in the training 
data. For that reason, in the next experiments we 
do not use the baseline at all, since it could hide 
the phenomenon addressed. (In practice, one might 
want to use a more sophisticated baseline, as in 
( Dermatas and Kokkinakis, 1995| ).) 



h 


/a 


SM(/i,/ 2 ,/3) 


SM( f u f 2 ,fl) 


8.6 


56.1 


65.7 


73.0 



Table 2: POS tagging of unknown words using 
contextual and lexical Features (accuracy in per- 
cent). /3 is based only on contextual features, /g is 
based on contextual and lexical features. SM(/j, fj) 
denotes that fj follows fa in the sequential model. 

Table | summarizes the results of the main exper- 
iment in this part. It exhibits the advantage of using 
the SM (columns 3,4) over a single classifier that 
makes use of the same features set (column 2). In 
both cases, all features are used. In /g, a classifier 
is trained on input that consists of all these features 
and chooses a label from among all class labels. In 
SM(f\, /2, /s) the same features are used as input, 
but different classifiers are used sequentially - using 
only part of the feature space and restricting the set 
of possible outcomes available to the next classifier 
in the sequence - /j chooses only from among those 
left as candidates. 



It is interesting to note that further improvement 
can be achieved, as shown in the right most column. 
Given that the last stage in 5M(/i, f'2, f%) is iden- 
tical to the single classifier /g, this shows the con- 
tribution of the filtering done in the first two stages 
using fi and ^2- In addition, this result shows that 
the input spaces of the classifiers need not be dis- 
joint. 

POS Tagging of Known Words 

Essentially everyone who is learning a POS tagger 
for known words makes use of a "sequential model" 
assumption during evaluation - by restricting the 
set of candidates, as discussed in Sec 4T). The fo- 
cus of this experiment is thus to investigate the ad- 
vantage of the SM during training. In this case, a 
single (one-vs-all) classifier trains each tag against 
all other tags, while a SM classifier trains it only 



time of our tagger was about twice faster than that 
of Brill's tagger. 



against the effective confusion set (Sec |4.1| ). 

Table || compares the performance of the / 3 clas- 
sifier trained using in a one-vs-all method to the 
same classifier trained the SM way. The results are 
only for known words and the results of Brill's tag- 



ger (Brill, 1995) are presented for comparison. 



one-vs-all 


SM^ ra j n 


Brill 


96.88 


96.86 


96.49 



Table 3: POS Tagging of known words using con- 
textual features (accuracy in percent), one-vs-all 
denotes training where example x serves as positive 
example to the true tag and as negative example to 
all the other tags. SM tra j n denotes training where 
example x serves as positive example to the true tag 
and as a negative example only to a restricted set of 
tags in based on a previous classifier - here, a sim- 
ple baseline restriction. 

While, in principle, (see Sec ||) the SM should do 
better (an never worse) than the one-vs-all classifier, 
we believe that in this case SM does not have any 
performance advantages since the classifiers work 
in a very high dimensional feature space which al- 
lows the one-vs-all classifier to find a separating hy- 
perplane that separates the positive examples many 
different kinds of negative examples (even irrelevant 
ones). 

However, the key advantage of the SM in this 
case is the significant decrease in computation time, 
both in training and evaluation. Table |] shows that 
in the pos tagging task, training using the SM is 6 
times faster than with a one-vs-all method and 3000 
faster than Brill's learner. In addition, the evaluation 





one-vs-all 


SM tra j n 


Brill 


Train 


1877.3 


313.5 


> 10 b 


Test 


2.3 * 10 -a 


4.3* 10 -3 



Table 4: Processing time for POS tagging of 
known words using contextual features (In CPU 
seconds). Train: training time over 10 5 sentences. 
Brill's learner was interrupted after 12 days of train- 
ing (default threshold was used). Test: average 
number of seconds to evaluate a single sentence. All 
runs were done on the same machine. 



5 The Sequential model: Theoretical 
Justification 

In this section, we discuss some of the theoretical 
aspects of the SM and explain some of its advan- 
tages. In particular, we discuss the following issues: 

1. Domain Decomposition: When the input fea- 
ture space can be decomposed, we show that it 
is advantageous to do it and learn several clas- 
sifiers, each on a smaller domain. 

2. Range Decomposition: Reducing confusion 
set size is advantageous both in training and 
testing the classifiers. 

(a) Test: Smaller confusion set is shown to 
yield a smaller expected error. 

(b) Training: Under the assumptions that a 
small confusion set (determined dynam- 
ically by previous classifiers in the se- 
quence) is used when a classifier is eval- 
uated, it is shown that training the classi- 
fiers this way is advantageous. 

3. Expressivity: SM can be viewed as a way to 
generate an expressive classifier by building 
on a number of simpler ones. We argue that 
the SM way of generating an expressive clas- 
sifier has advantages over other ways of doing 
it, such as decision tree. (Sec 5.3). 



In addition, SM has several significant computa- 
tional advantages both in training and in test, since 
it only needs to consider a subset of the set of can- 
didate class labels. We will not discuss these issues 
in detail here. 



5.1 Decomposing the Domain 

Decomposing the domain is not an essential part of 
the SM; it is possible that all the classifiers used ac- 
tually use the same domain. As we shown below, 
though, when a decomposition is possible, it is ad- 
vantageous to use it. 

It is shown in Eq. ||-|7| that when it is possible to 
decompose the domain to subsets that are condition- 
ally independent given the class label, the SM with 
classifiers defined on these subsets is as accurate as 
the optimal single classifier. (In fact, this is shown 
for a pure product of simpler classifiers; the SM uses 
a selective product.) 

In the following we assume that X 1 ,... ,X N 
provide a decomposition of the domain X (Sec. ||) 
and that (x\ . . . ,x N ) G (X 1 ,... ,X N ). By con- 
ditional independence we mean that 

j 

p(x l ,...,x J \c) =J\p(x k \c), 

k=i 

where x k is the input for the kth classifier. 

argmax p(c\x) = argmax p(c\x l , x ) (2) 
ceC cec 



arg max 

cGC 



p(x 1 , x N \c) ■ p{c) 



p{x l , ...,x N ) 

argmax p(x l , x N \c) ■ p(c) 
cec 

arg max p(x l \ c) ■ ■ ■ p(x N \c) ■ p(c) 



(3) 
(4) 
(5) 



ceC 



p(c\x 1 )p(x 1 ) p(c\x N )p(x N ) . 

arg max — • • • — • p(c) 

cec P{c) p(c) 

(6) 



arg max p(c\x l ) ■ ■ ■ p(c\x 



1 



cec 



p{c) 



N-l 



(V) 



p(x 1 , ...,x N ) in Eq. ||is identical Vc G C and there- 
fore can be treated as a constant. Eq. |] is derived by 
applying the independence assumption. Eq. ^ is de- 
rived by using the Bayes rule for each term p(c\x l ) 
separately. 

We note that although the conditional indepen- 
dence assumption is a strong one, it is a reasonable 
assumption in many NLP applications; in particu- 
lar, when cross modality information is used, this 
assumption typically holds for decomposition that 
is done across modalities. For example, in POS tag- 
ging, lexical information is often conditionally in- 
dependent of contextual information, given the true 



POS. (E.g., assume that word is a gerund; then the 
context is independent of the "ing" word ending.) 

In addition, decomposing the domain has signif- 
icant advantages from the learning theory point of 
view ( |Roth, 1999| ). Learning over domains of lower 
dimensionality implies better generalization bounds 
or, equivalently, more accurate classifiers for a fixed 
size training set. 

5.2 Decomposing the range 

The SM attempts to reduce the size of the candidates 
set. We justify this by considering two cases: (i) 
Test: we will argue that prediction among a smaller 
set of classes has advantages over predicting among 
a large set of classes; (ii) Training: we will argue 
that it is advantageous to ignore irrelevant examples. 

5.2.1 Decomposing the range during Test 

The following discussion formalizes the intuition 
that a smaller confusion set in preferred. Let / : 
X — ► C be the true target function and p(cj\x) the 
probability assigned by the final classifier to class 
Cj G C given example x G X. Assuming that 
the prediction is done, naturally, by choosing the 
most likely class label, we see that the expected er- 
ror when using a confusion set of size k is: 

Errors = E x [(argmax p(cj\x)) ^ f(x)] 

l<j<k 

= p((argmax p(cj\x)) ^ fix)) (8) 
Now we have: 

Claim 1 Let K = {a,...,c k },K' = {ci, c k+r } 
be two sets of class labels and assume fix) G K 
for example x. Then Error ^ < Errors . 

Proof. Denote: 

pe(a,b,f) = pi(argmax p(cj\x)) ^ fix)) 

Then, 

Errors 1 = 

= E x [(argmax p{cj\x)) ^ fix)} 

l<j<k+r 

= pe(l,k + r,f) 

= pe(l, k, f)+ (1- peil, k, f))pe(k + l,k + r,f) 
= Error k + (1— Errorx)peik + 1, k + r, /) 
> Error k 



Claim [I] shows that reducing the size of the con- 
fusion set can only help; this holds under the as- 
sumption that the true class label is not eliminated 
from consideration by down stream classifiers, that 
is, under the one-sided error assumption. Moreover, 
it is easy to see that the proof of Claim [T] allows us 
to relax the one sided error assumption and assume 
instead that the previous classifiers err with a prob- 
ability which is smaller than: 

(1 — Errorx) • pe(k + l,k + r, f(x)). 

5.2.2 Decomposing the range during training 

We will assume now, as suggested by the previous 
discussion, that in the evaluation stage the small- 
est possible set of candidates will be considered by 
each classifier. Based on this assumption, Claim ^ 
shows that training this way is advantageous. That 
is, that utilizing the SM in training yields a better 
classifier. 

Let A be a learning algorithm that is trained to 
minimize: 

L(y ■ h(x))p(x)dx, 

xex 

where x is an example, y G {— 1, +1} is the true 
class, h is the hypothesis, L is a loss function and 
p(x) is the probability of seeing example x when 
x ~ P (see flAUwein et al., 200C| )). (Notice that in 
this section we are using general loss function L; we 
could use, in particular, binary loss function used 



in Sec 5.2.) We phrase and prove the next claim, 
w.l.o.g, the case of 2 vs. 3 class labels. 

Claim 2 Let C = {c\ ,02,03} be the set of class la- 
bels, let Si be the set of examples for class i. Assume 
a sequential model in which class c\ does not com- 
pete with class C3. That is, whenever x & Si the 
SM filters out C3 such that the final classifier (/n) 
considers only c\ and c-i- Then, the error of the hy- 
pothesis - produced by algorithm A (for fiy)- when 
trained on examples in {Si, S2} is no larger than 
the error produced by the hypothesis it produces 
when trained on examples in {Si, S2, S3}. 

Proof. Assume that the algorithm A, when 
trained on a sample S, produces a hypothesis that 
minimizes the empirical error over S. 

Denote x ~ Pc when x is sampled according to 
a distribution that supports only examples with label 
in C. Let S be a sample set of size m, according to 



Pj.,2. and h! the hypothesis produced by A. Then, 
for all h^h', 



-J2Hyh'(x))<-J2L(yh(x)) (9) 

rn ^ — * ffi £ — ■* 



m 



In the limit, as m — > 00 



L(yh! (x))p(x)dx < J L(yh(x))p(x)dx . 
x~Fi,2 a;~Pi,2 

In particular this holds if h is a hypothesis pro- 
duced by A when trained on S', that is sampled ac- 
cording to x ~ Pi ) 2,3. ■ 

5.3 Expressivity 

The SM is a decision process that is conceptu- 
ally similar to a dec ision tree processes (R asoul 
and Landgrebe, 1991; Mitchell, 1997| ), especially 
if one allows more general classifiers in the deci- 
sion tree nodes. In this section we show that (i) the 
SM can express any DT. (ii) the SM is more com- 
pact than a decision tree even when the DT makes 
used of more expressive internal nodes (Murthy and 
Salzberg, 1994). 

The next theorem shows that for a fixed set of 
functions (queries) over the input features, any bi- 
nary decision tree can be represented as a SM. Ex- 
tending the proof beyond binary decision trees is 
straight-forward. 

Theorem 3 Let T be a binary decision tree with N 
internal nodes. Then, there exist a sequential model 
S such that S and T have the same size, and they 
produce the same predictions. 

Proof (Sketch): Given a decision tree T on N 
nodes we show how to construct a SM that produces 
equivalent predictions. 

1. Generate a confusion set C the consists of N 
classes, each representing an internal node in 
T. 

2. For each internal node in d G T, assign a clas- 
sifier: fi : X x C -» [0, l]m-l+M_ 

3. Order the classifiers fi, .../n such that a clas- 
sifier that is assigned to node d is processed 
before any classifier that was assigned to any 
of the children of d. 



4. Define each classifier /j that was assigned to 
node d € T to have an influence on the 
outcome iff node d G T lies in the path 
(&0) b\, from the root to the predicted 
class. 

5. Show that using steps 1-4, the predicted target 
of T and 5 are identical. 

This completes that proof and shows that the result- 
ing SM is of equivalent size to the original decision 
tree. 

We note that given a SM, it is also relatively easy 
(details omitted) to construct a decision tree that 
produces the same decisions as the final classifier of 
the SM. However, the simple construction results in 
a decision tree that is exponentially larger than the 
original SM. Theorem ^| shows that this difference 
in expressivity is inherent. 

Theorem 4 Let N be the number of classifiers in 
a sequential model S and the number of internal 
nodes a in decision tree T. Let m be the set 
of classes in the output of S and also the maxi- 
mum degree of the internal nodes in T. Denote by 
F(T), F(S) the number of functions representable 
by T,S respectively. Then, when m » N, F(S) 
is exponentially larger than F(T). 

Proof (Sketch): The proof follows by counting 
the number of functions that can be represented 
usin g a decision tree with N internal nodes(W ilf, 
1994), and the number of functions that can be rep- 
resented using a sequential model on N intermedi- 
ate classifier. Given the exponential gap, it follows 
that one may need exponentially large decision trees 
to represent an equivalent predictor to an N size 
SM. 

6 Conclusion 

A wide range and a large number of classifica- 
tion tasks will have to be used in order to perform 
any high level natural language inference such as 
speech recognition, machine translation or question 
answering. Although in each instantiation the real 
conflict could be only to choose among a small set 
of candidates, the original set of candidates could be 
very large; deriving the small set of candidates that 
are relevant to the task at hand may not be immedi- 
ate. 

This paper addressed this problem by developing 
a general paradigm for multi-class classification that 



sequentially restricts the set of candidate classes to 
a small set, in a way that is driven by the data ob- 
served. We have described the method and provided 
some justifications for its advantages, especially in 
NLP-like domains. Preliminary experiments also 
show promise. 

Several issues are still missing from this work. 
In our experimental study the decomposition of the 
feature space was done manually; it would be nice 
to develop methods to do this automatically. Bet- 
ter understanding of methods for thresholding the 
probability distributions that the classifiers output, 
as well as principled ways to order them are also 
among the future directions of this research. 
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